Qi C R, Liu W, Wu C, et al. Frustum pointnets for 3d object detection from rgb-d data[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 918-927.

1. Overview

Previous method focus on images or 3D voxels
Treat RGB-D data as 2D maps for CNN
Learning in 3D space can better exploit the geometric and topological structure of 3D space and apply transformation

In this paper, it proposed Frustum PointNet

operate on raw point clouds by RGB-D scans
leverage both 2D detection and 3D object localization
key challenge. efficientlypropose possible localtions of 3D obj in 3D space

2D proposal→frustum proposal→segmentation→3D box estimation
coordinate normalization

Front View Image Based Methods
represent depth data as 3D maps
Bird’s Eye View Based Methods
MV3D. project LiDAR point cloud to bird’s eye view and train RPN for 3D bounding box proposal
3D Based Methods
Deep Learning on Point Clouds

1.2. Problem Definitions

depth data. obtained from LiDAR or indoor depth sensors and represented as a point cloud
the projection matrix is known. can get a 3D frustum from a 2D image region
3D box is parameterized by size (h, w, l), center (c_x, c_y, c_z) and orientation (Θ, φ, ψ).
only consider heading angle Θ in this paper.

1.3. Dataset

KITTI (outdoor). RGB + LiDAR point cloud (sparse due to distence)
SUN-RGBD (indoor). RGB-D (dense)
general framework to sparse cloud and dense cloud.

2. Frustum PointNets

2.1. Frustum Proposal

The resolution of data produced by moist 3D sensors (especially real-time depth sensors) is still lower than RGB image from commodity camera.

2D RGB detector. Fast R-CNN, FPN, focal loss
with known camera projection matrix, 2D box can be lifted to frustum
rotate. center axis of frustum if orthogonal to the image plane

2.2. 3D Instance Segmentation

2.2.1. V1 PointNet

2.2.2. V2 PointNet++

directly regress 3D object location from a depth map using 2D CNN is not easy, as occluding objects and background clutter
segmentation (binary classification of pixel level) in 3D point cloud is much more natural
leverage the semantics from 2D detector (one-hot class vector)
segmentation network can use this prior to find geometries of that category.
coordinate normalization. transform the point cloud by subtracting XYZ values of centroid
mask the input frustum

2.3. Amodal 3D Box Estimation

2.3.1. T-Net

the origin of the mask coordinate frame may be far from the amodal box center
STN (no direct supervision) vs T-Net (explicitly supervise)

2.3.2. Box Estimation PointNet

V1 PointNet

V2 PointNet++

box center residual prediction. combined with the previous center residual from the T-Net and the masked points’ centroid to recover an absolute center
pre-defined NS size templates (3:height, width, length) and NH equally split angle (Θ) bins (NS scores for size, NH socres for heading)
output dimension. 3(center point) + 4xNS + 2xNH

2.4. Multi-task Loss

L_{c1-reg}. center of T-Net
L_{c2-reg}. center of box estimation net
L_{h-cls}, L_{h-reg}. heading angle prediction
L_{s-cls}, L_{s-reg}. size prediction
L_{corner}. corner loss for joint optimization pf box parameters

Optimized for final 3D box accuracy (center, size and heading) have separate loss terms. And they should be jointly optimized→corner positions are jointly determined by center, size and hearding.

for each of NS x NH box
only foucs on the gt size/heading class
sum of the distance between the eight corners of prediction and gt box

To avoid large penalty from flipped heading estimation, further compute p from the flipped gt box and use minimum** of them.

3. Experiments

3.1. Comparison

3.2. Ablation Study

3.2.1. 2D vs 3D

contains clutter and background

3.2.2. contains clutter and background

Frustum rotation and mask centroid subtraction are critical

3.2.3. Loss Function

3.2.4. PointNet Version

3.3. Failure Case

inaccurate pose and sparse cloud (less than 5 points)
multiple instances from the same category in a frustum
2D detector misses objects due to dark light or strong occlusion